{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"../../img/ods_stickers.jpg\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## 线性回归和随机梯度下降"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "本次挑战中，你需要使用随机梯度下降方法来完成线性回归问题。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "首先，我们导入挑战可能需要到的模块："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "from matplotlib import pyplot as plt\n",
    "from sklearn.base import BaseEstimator\n",
    "from sklearn.metrics import log_loss, mean_squared_error, roc_auc_score\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from tqdm import tqdm\n",
    "\n",
    "%matplotlib inline\n",
    "warnings.filterwarnings(\"ignore\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下面，加载挑战所使用的示例数据集。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_demo = pd.read_csv(\"../../data/weights_heights.csv\")\n",
    "data_demo.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "每个样本数据包含 2 个特征，我们绘制二维散点图。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.scatter(data_demo[\"Weight\"], data_demo[\"Height\"])\n",
    "plt.xlabel(\"Weight (lbs)\")\n",
    "plt.ylabel(\"Height (Inch)\")\n",
    "plt.grid()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "接下来，设定 `Weight` 为 `X`，`Height` 为 `y`。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X, y = data_demo[\"Weight\"].values, data_demo[\"Height\"].values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "然后将数据集切分为训练和验证数据，并进行规范化。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_valid, y_train, y_valid = train_test_split(\n",
    "    X, y, test_size=0.3, random_state=17\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scaler = StandardScaler()\n",
    "X_train_scaled = scaler.fit_transform(X_train.reshape([-1, 1]))\n",
    "X_valid_scaled = scaler.transform(X_valid.reshape([-1, 1]))\n",
    "\n",
    "X_train_scaled.shape, X_valid_scaled.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们同样可以将规范化之后的训练数据绘制成散点图。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.scatter(X_train_scaled, y_train)\n",
    "plt.xlabel(\"Weight (lbs)\")\n",
    "plt.ylabel(\"Height (Inch)\")\n",
    "plt.grid()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "接下来，你需要实现一个使用随机梯度下降方法的线性回归类，并使其可以完成训练和测试的过程。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<i class=\"fa fa-question-circle\" aria-hidden=\"true\"> 问题：</i>按下面的要求实现随机梯度下降线性回归类。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- 类名为 `SGDRegressor`，其继承自 `sklearn.base.BaseEstimator`。\n",
    "- 构造函数接受参数 `eta` 学习率（默认为 $10^{-3}$）和 `n_epochs` 全数据集迭代次数（默认为 3）。\n",
    "- 构造函数创建 `mse_` 和 `weights_` 列表，以便在梯度下降迭代期间追踪均方误差和权重向量。\n",
    "- 该类需包含 `fit` 和 `predict` 方法用于训练和预测。\n",
    "- `fit` 方法可接受矩阵 `X` 和向量 `y`（`numpy.array` 对象）作为参数。该方法可自动在 `X` 左侧追加一列全为 `1` 的值作为截距项系数，权重 `w` 则统一用零初始化。然后进行 `n_epochs` 权重更新迭代，并将每次迭代后的均方误差 MSE 和权重向量记录在预先初始化的空列表中。\n",
    "- `fit` 方法返回 `SGDRegressor` 类的当前实例，即 `self`。\n",
    "- `predict` 方法可接受矩阵 `X` 矩阵，同样需支持自动在 `X` 左侧追加一列全为 `1` 的值作为截距项系数，并使用由 `fit` 方法得到的权重向量 `w_` 计算后返回预测向量。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "接下来，我们实例化 `SGDRegressor` 类，并传入训练数据："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sgd_reg = SGDRegressor()\n",
    "sgd_reg.fit(X_train_scaled, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果一切正常的话，我们可以经由以下代码输出迭代过程中 MSE 的变化曲线："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(range(len(sgd_reg.mse_)), sgd_reg.mse_)\n",
    "plt.xlabel(\"#updates\")\n",
    "plt.ylabel(\"MSE\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "打印出 MSE 的最小值，以及最终的权重系数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.min(sgd_reg.mse_), sgd_reg.w_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "绘制出 $w_0$ 和 $w_1$ 在训练过程中的变化曲线。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.subplot(121)\n",
    "plt.plot(range(len(sgd_reg.weights_)), [w[0] for w in sgd_reg.weights_])\n",
    "plt.subplot(122)\n",
    "plt.plot(range(len(sgd_reg.weights_)), [w[1] for w in sgd_reg.weights_])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "最后，使用 `(X_valid_scaled, y_valid)` 作出预测，并计算验证集上的 MSE 值。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sgd_holdout_mse = mean_squared_error(y_valid, sgd_reg.predict(X_valid_scaled))\n",
    "sgd_holdout_mse"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "最后，我们通过一个单元测试来保证 `SGDRegressor` 已经正常工作。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里使用 `sklearn.linear_model` 提供的 `LinearRegression` 来计算普通最小二乘法得到在验证集上的 MSE 值。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "\n",
    "lm = LinearRegression().fit(X_train_scaled, y_train)\n",
    "print(lm.coef_, lm.intercept_)\n",
    "linreg_holdout_mse = mean_squared_error(y_valid, lm.predict(X_valid_scaled))\n",
    "linreg_holdout_mse"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果 `SGDRegressor` 得到的结果和 `SGDRegressor` 得到的结果在 $10^{-4}$ 范围之内，我们认为测试通过。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    assert (sgd_holdout_mse - linreg_holdout_mse) < 1e-4\n",
    "    print(\"Correct!\")\n",
    "except AssertionError:\n",
    "    print(\n",
    "        \"Something's not good.\\n Linreg's holdout MSE: {}\"\n",
    "        \"\\n SGD's holdout MSE: {}\".format(linreg_holdout_mse, sgd_holdout_mse)\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"background-color: #e6e6e6; margin-bottom: 10px; padding: 1%; border: 1px solid #ccc; border-radius: 6px;text-align: center;\"><a href=\"https://nbviewer.jupyter.org/github/shiyanlou/mlcourse-answers/tree/master/\" title=\"挑战参考答案\"><i class=\"fa fa-file-code-o\" aria-hidden=\"true\"> 查看挑战参考答案</i></a></div>"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
